Again, first of all, lets read some data
In [ ]:
# first, the imports
import os
import datetime as dt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
np.random.seed(19760812)
%matplotlib inline
In [ ]:
# We read the data in the file 'mast.txt'
ipath = os.path.join('Datos', 'mast.txt')
def dateparse(date, time):
YY = 2000 + int(date[:2])
MM = int(date[2:4])
DD = int(date[4:])
hh = int(time[:2])
mm = int(time[2:])
return dt.datetime(YY, MM, DD, hh, mm, 0)
cols = ['Date', 'time', 'wspd', 'wspd_max', 'wdir',
'x1', 'x2', 'x3', 'x4', 'x5',
'wspd_std']
wind = pd.read_csv(ipath, sep = "\s*", names = cols,
parse_dates = [[0, 1]], index_col = 0,
date_parser = dateparse)
We can access the elements using indexing like if it was a numpy array or how we do it in Python:
In [ ]:
wind[0:10]
In [ ]:
wind['2013-09-04 00:00:00':'2013-09-04 01:30:00']
In this second example indexing is made using strings, that are the representation on the indexes (labels). We can also highlight that, in this case, the last element in the slice IS INCLUDED.
In previous examples, we havel also seen that we could select columns using its name:
In [ ]:
wind['wspd'].head(3)
Depending how are defined the column names we can access the column values using dot notation but this way not always work so I strongly recommend not to use it:
In [ ]:
# Thi is similar to what we did in the previous code cell
wind.wspd.head(3)
In [ ]:
# An example that can raise an error
df1 = pd.DataFrame(np.random.randn(5,2), columns = [1, 2])
df1
In [ ]:
# This will be wrong
df1.1
In [ ]:
# In order to use it we have to use
df1[1]
You can also use Fancy indexing with Series
, like if we were indexing with a list or a boolean array:
In [ ]:
# Create a Series
wspd = wind['wspd']
# Access the elements located at positions 0, 100 and 1000
print(wspd[[0, 100, 1000]])
print('\n' * 3)
# Using indexes at locations 0, 100 and 1000
idx = wspd[[0, 100, 1000]].index
print(idx)
print('\n' * 3)
# We access the same elements than initially but using the labels instead
# the location of the elements
print(wspd[idx])
With DataFrame
s the fancy indexing can be ambiguous and it will raise an IndexError
.
In [ ]:
# Try it...
Like with numpy, we can access values using boolean indexing:
In [ ]:
idx = wind['wspd'] > 35
wind[idx]
We can use several conditions. for instance, let's refine the previous result:
In [ ]:
idx = (wind['wspd'] > 35) & (wind['wdir'] > 225)
wind[idx]
Using conditions coud be less readable. Since version 0.13 you can use the query
method to make the expression more readable.
In [ ]:
# To make it more efficient you should install 'numexpr'
# tht is the default engine. If you don't have it installed
# and you don't define the engine ('python') you will get an ImportError
wind.query('wspd > 35 and wdir > 225', engine = 'python')
Using these ways of selection can be ambiguous in some cases. Let's make a parenthetical remark to come bacllater to see more advanced ways of selection.
In [ ]:
s1 = pd.Series(np.arange(0,10), index = np.arange(0,10))
s2 = pd.Series(np.arange(10,20), index = np.arange(5,15))
print(s1)
print(s2)
Now, if we perform an operation between both Series
, where there are the same index we can perform the operation and where there are no indexes on both sides of the operation we conserve the index in the result but the operation could not be performed and a NaN
is returned but we will not get an error:
In [ ]:
s1 + s2
One of the basic features of pandas
is the rows and columns index labeling, this can make that indexing could be more complex than in numpy. We have to distinguish between:
Indexing in Series
is simpler as the labels refer to row labels (indexes) as there is only one column. As we have been learning in a vague manner, for a DataFrame
, basic indexing select columns.
To select only a column, as we have seen previously:
In [ ]:
wind['wspd_std']
Or we can select several columns:
In [ ]:
wind[['wspd', 'wspd_std']]
But with slicing we will access the indexes:
In [ ]:
wind['2015/01/01 00:00':'2015/01/01 02:00']
So the following will provide an error:
In [ ]:
wind['wspd':'wdir']
In [ ]:
wind[['wspd':'wdir']]
Uh, what a mess!!
We have several available methods to index in a pandas
data structure:
loc
: it is used when we use the columns and rows labels to index (it also accepts boolean arrays).
iloc
: this option is based in element positions (like if it was a numpy array).
ix
: it is a combination of both previous methods.
This methods are also available in Series
but with Series
are not so useful as indexing is not ambiguous.
Let's see how these methods work in a DataFrame
...
Select the first three items in columns 'wspd'
and 'wspd_max'
:
In [ ]:
wind.loc['2013-09-04 00:00:00':'2013-09-04 00:20:00', 'wspd':'wspd_max']
In [ ]:
wind.iloc[0:3, 0:2] # similar to indexing a numpy arrays wind.values[0:3, 0:2]
In [ ]:
wind.ix[0:3, 'wspd':'wspd_max']
A fourth way not seen before would be:
In [ ]:
wind[0:3][['wspd', 'wspd_max']]
In [ ]:
wind[['wspd', 'wspd_max']][0:3]
Return all the January 2014 values
Compute the mean wind speed during february 2014
Use the query
method to obtain all wind speeds coming from North (in a range between $\pm$ 10 º considering North oriented towards North 0º) and with a wind speed above 10 m/s
The same as before but using a boolean array
All the previous problems can be solved loc
, iloc
and/or ix
. Practice all the possibilities.
In [ ]:
In [ ]:
wind.between_time('00:00', '00:30').head(20)
In [ ]:
# It also works with series:
wind['wspd'].between_time('00:00', '00:30').head(20)